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Pervasive  Speech  recognition 


Chalapathy  Neti  & Gerasimos  Potamianos 
IBM  T.  J.  Watson  Research  Center 
Yorktown  Heights,  NY  10598 


•A/V  speech  recognition  architecture 
•Visual  feature  extraction 


•Audio-visual  fusion 


•Challenges  and  conclusions 


> Pervasive  deployment  of  speech  will  require  better 
recognition  in  degraded  acoustic  conditions: 

• High  noise  (“speech  babble”)  e.g. 

S Military  applications 
S Automobiles 

S Video  Games  & Interactive  television  ^ Pjg|y| 

• Whispered  Speech 

• Privacy 

• Lombard  speech  j £] 

• High-noise  conditions  (BBS 

• Speech  pathology 

Audio-Visual  speech  recognition  is  a key  enabler 


Audio-visual  speech  recognition:  architecture 


IBM’s  A/V  speech  effort 

History  (www.research.ibm.com/AVSTG) 

- About  a 3 year  old  effort. 

- Led  die  JHU  Workshop  team  on  A/V  speech  recognition.  2000 

- AVSTG  department  formed  in  2001 

- Taught  an  invited  ELSNET  tutorial  on  A/V  speech  recognition  (Prague.  2001). 
Highlights/differentiators  of  our  work 

- One  of  a kind  database  for  AV  LVCSR 

- State-of-the-art  audio  ASR  subsystem  (LVCSR) 

- Fully  automated  visual  front  end 

* Multirexolutinn  face  detection 

■ Augmented  visual  speech  ROl  (jaw  region  instead  of  mouth) 

* Multistage  (linear  transform  based)  visual  feature  extraction 

- Sub-phonetic  visual  speech  models 

* Scales  to  large-vocabulary  recognition 

- Phone-level  asynchronous  A/V  fusion 

* Joini  aA'  model  training 

- Maximum  entropy  based  stream  weight  estimation  (global  and  local) 

- Multiple  domain  exploration 

* Read  speech  (digits/C&C/l.VCSR).  Impaired  speech.  Automobile.  Broadcast  News 
» Visual  adaptation  to  new  domains 


Region  of 
Interest 
Extraction 


Visual 
Features 
(e  g DCT) 


Audio 
Features 
e g.  Cepstra 
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IBM  WAV  databases 


LVCSR 

- Jirxt-of-a-kmJ  audio-visual 
database  for  large-vocabulary 
continuous  SI  speech 
recognition  d.VCSR) 

- 290  subjects 

- 70  hrs.  continuous  speech. 

! 0.400  word  vocabulary 

Digits 

- 50  subjects 

- 8 46  hrs.  continuous  speech.  1 1 
word  vocabulary 

Database  Format 

- Frontal  face  color  video. 
704x480.  30  Hz.  MPEG2 

- 16  kHz  '1 6b it  pem 


Experiments  on  Digits 
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VIST  ALLY  CHAI.LF.NGING  DOMAIN'S:  PRELIMINARY  RKSILTS 
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; AI  DIOA  ISl  Al,  SPEAKER  ADAPTATION  I 

• Important  for  »|v»li«t  inrttlltoftit  awl  Liettnl  data  rl»>ni»itr».  tint 
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Conclusions 


Consistent  and  significant  gains  for  all  audio  conditions 
Significant  performance  gains  in  “speech-babble"  noise 

- Effective  gain  of  10  dB  @ 10  dB  SNR  for  digits 

- Effective  gains  of  7.5  dB  @ 10  dB  for  LVCSR 

Significant  gains  in  relatively  clean  environments 

- 62%  relative  gain  for  digits  (0.75  ->  0,28) 

- 8%  for  LVCSR 

Super-human  performance  at  high-noise  levels 
Asynchrony  modeling  helps  for  digits 
Further  research  required  in  visually  challenging  domains 
Visual  adaptation  is  a promising  approach 

- Upto  67%  relative  improvement  in  visual  speech  recognition 
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Real-Time  Face  Tracker 


Three  Types  of  Models  have  been  employed 

• skin-color  model  to  register  the  face 

• motion  model  to  estimate  image  motion 

• camera  mode!  to  predict  and  compensate  for  camera  motion  (pan.  tilt,  zoom) 


The  Face  Tracker 

• tracks  a persons  face  while  person  is 
freely  moving  (e.g.  walks,  jumps, 
sits  down  and  stand  up) 

• Framerate  : 30+  frames  per  second 
using  a low  end  workstation  (HP9000) 
or  Pentium  II  266  PC. 


Interactive  S>  stems  Labs 


Face  and  Pose  Trackin 


What? 

Large  Vocabulary  Speech  Recognition 

- Issues: 

• Sloppy  Speech 

• Distant  Microphones 

• Mismatch  in  Vocabulary 

• Other  Languages 

- Many  Other  Aspects:  Topic  Detection,  Named  Entity,  Translation, 
Discourse,  .... 

Multimodal  Dialog 

- Fuse  Speech,  Pointing.  Gesture,  Handwriting 

- Fusion  Usually  at  Semantic  Level 

Audio-Visual  Speech 

- Combine  Speech  and  Visual  Info 

Interactive  Systems  Labs  ^ 


From  Read  Speech  to 
Conversational  Speech 


Wall  Street  Journal  Dictation 
Broadcast  News  Database 

- Transcription  and  Information  Retrieval  on  News  Casts 

- Multilingual  Speech  Recognition 

Switchboard  & Callhome 

- Human  to  Human  Telephone  Speech 

Meetings  and  Discussion  Database 

- Newshour  (18h) 

- Crossfire  (9h) 

- Group  Meetings 

Interactive  Systems  Labs 


Transcribing  Speech  in  Meetings 


Run-On  Transcription  of  Meetings 

• Mismatched  Recording  Conditions 

- Remote  Microphones 

- Cross-Talk 

- Recording  Always  on  ! 

- Noise 

- Multiple  Speakers 

• Mismatched  Speaking  Style: 

- Spontaneous  and  Conversational 
Human  to  Human  Speech 

- Emotional  Speech 

• Mismatched  Language  and  Vocabulary 

- Special  1 deosynchrati  c T opic 

*****  Interactive  Systems  Labs 


Three  Tasks: 

- Newshour 

- Crossfire 

- Group  Meetings 
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People  Identification:  Challenges 


How  ? 


Illumination 


Head  pose 


Occlusion 
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Interactive  S' stems  Labs 


Detect  Emotional  State 

- Happy,  Angry.  Sad,  Afraid 

- Distress,  Busy.  Relaxed... 

Techniques: 

- Acoustic:  (Polzin.  1999) 

• Prosody:  Intensity,  Pitch,  Rhythm. 

• Language:  Words  and  Expressions  Used 

- Visual:  (Cohen) 

• Facial  Expressions 

Inlcrach'c  Systems  Labs 


mjjgjBa 


74 


Audio-Visual  Speech  Data  Set 


• Thanks  to  Intel 

• 78  isolated  words  10  times 

* Date/ti me/month/day/etc. 

* Audio:  44.8  kHz,  16  bits 

* Video:  30/60Hz,  720*640 


• Lip  parameters  extracted 
» Noises 

♦ Gaussian  white/pink  noise, 
car,  factory  (Noise-X  92) 

♦ Babble/crosstalk 

♦ Lombard  Effect 
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Result 


Strong  Lombard  EfTact 


-synchronous  B 
- nynehronout  jj 


— synchronous 

— asynehronoo 


Product  Rule  vs.  Sum  Rule 

(For  Speaker  Identification) 

1! 

Tsuhan  Cben 
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Product  Rule  vs.  Sum  Rule 
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Product  rule  fails 
dramatically  in  large 
train/test  mismatch. 

Hybrid  provides  good 
performance  in  all 
contexts. 
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Quick  Recap 


a Asynchronous  MI  (U)  has  more  freedom  than 
synchronous  MI  (El)  -»  Better  performance 

a Product  rule  is  better  in  Bayesian  sense,  but 
sum  rule  is  more  robust  to  mismatch 

a Robustness  to  weighting 
« Need  to  be  careful  about  Lombard  Effect 
a Key  to  multimodal  fusion 

* To  model  dependency  between  audio  and  visual 
signals 

* To  dampen  independent  audio  and  visual  noises 


Beyond  Multimodal  ASR... 
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Cleaned  Audio 
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Conclusions 

» Database  is  essential 

* Need  to  consider  Lombard  Effect 

a Fusion  is  important 

* We  can  learn  from  acoustic  ASR;  we  can 
perhaps  lead  ASR 

a Confidence  estimation  is  important 
a Visual  channel  is  not  noise-free 


Illumination  Variation 


22  illumination  conditions  with  background  light 


Related  Forums 

« IEEE  Multimedia  Signal  Processing  (MMSP)  Technical 
Committee,  1996~ 

« Proceedings  of  IEEE,  Special  Issue  on  MMSP,  1998 
» IEEE  MMSP  Workshops 

* Princeton  1997,  Los  Angeles  1998,  Copenhagen  1999,  Cannes 

2001,  St.  Thomas  2002 

o IEEE  International  Conf.  on  Multimedia  and  Expo.  (ICME) 

* New  York  2000,  Tokyo  2001,  Lausanne  2002,  Baltimore  2003 
o IEEE  Transactions  on  Multimedia,  March  1999~ 

* Special  issues:  networked  multimedia  2001,  multimedia  database 

2002,  multimodal  interface  2003 
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Please  visit  us  at: 

http: //amp.ece.cmu.edu 

Or,  please  email  me  at 
tsuhan@cmu.edu 

Tsuhan  Chen 


